Techniques for early detection of localized exposure to an agent active on a biological population

ABSTRACT

Technique for early detection of localized exposure to an agent active on a biological population include collecting time series for each data type of multiple different data types. The data types are relevant for detecting exposure to the agent. For each data type multiple time series are collected for corresponding multiple locations associated with the data type. Measures of anomalous conditions are generated at the locations for each of the different data types. The measures of anomalous conditions are based on the time series and a temporal model for each data type. Cluster analysis is performed on the measures of anomalous conditions to determine an estimated location, and an estimated extent, of effects from the agent. The techniques allow a surveillance system to avoid diluting the signal of a localized outbreak over too large and area or consuming excessive resources in computing replicas for a matched filter detector.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims benefit of Provisional Appln. 60/337,307, filedDec. 4, 2001, the entire contents of which are hereby incorporated byreference as if fully set forth herein, under 35 U.S.C. §119(e). Thisapplication also claims benefit as a continuation-in-part of PCTApplication Ser. No. PCT/US01/09244, filed Mar. 23, 2001 the entirecontents of which are hereby incorporated by reference as if fully setforth herein, under 35 U.S.C. §120.

STATEMENT OF GOVERNMENTAL INTEREST

This invention was made with U.S. Government support under DefenseAdvanced Research Projects Agency (DARPA) Contract No. MDA972-96-D-0002.The U.S. Government has certain rights in the invention.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to performing surveillance on a biologicalpopulation for exposure to an agent that acts on members of thatpopulation; and in particular to the early detection of localizedexposure using cluster analysis on anomalous conditions determined fromtime series of multiple data types.

2. Description of the Related Art

The past approaches described in this section could be pursued, but arenot necessarily approaches that have been previously conceived orpursued. Therefore, unless otherwise indicated herein, the approachesdescribed in this section are not to be considered prior art to theclaims in this application merely due to the presence of theseapproaches in this background section.

Recent history demonstrates that weapons of mass destruction can bebuilt and deployed by almost any individual or group that has intent tocause harm or that is looking chemical and biological agents. Theseweapons, banned from wartime usage, have nevertheless proliferated inthird world countries. Information on the development and deployment ofthese weapons has become widely available on the Internet. Materials toproduce some agents are also readily available. Certain biologicalagents pose a particularly insidious threat in that a clandestinerelease into a population may not be noticed during the incubationperiod of the resultant disease. Yet, concerning agents such as anthrax,once the symptoms are manifested it is no longer possible to treat thevictim and high mortality is inevitable. Contagious agents like smallpoxor the plague pose even greater threats. Such agents require earlyidentification of an infected population in order to treat the victimsand contain a potentially devastating epidemic.

Use of biological weapons therefore poses very serious issues for crisisand consequence management. Various State and local emergency managementplans utilize fire, rescue, and law enforcement first responders toprovide emergency assistance, to control an incident site, and tocollect evidence for criminal prosecution. For clandestine bio-agentreleases, the medical community may be the first to see patients presentwith uncommon diseases. These diseases include small pox, plague,tularemia, anthrax, etc., and have a high mortality rate. In order toinstitute measures to contain disease outbreaks, public health officialsmust receive timely reports from agencies and health providers in theirjurisdiction. Early warning is key to managing an epidemic and savinglives. However, the first indicators of a bio-terrorist event may be theonset of disease in humans and animals. And professionals from thehealth care community may not be able to recognize the early signs ofdiseases that would result from bio-terrorism. Early diagnosis of suchdiseases is often difficult because the diseases generate only common“flu-like” initial symptoms.

To overcome the obstacles concerning an effective early warning system,improved technology is needed. Information technology and advancedtelecommunications can play a major role in improving surveillance forbiological and chemical weapons of mass destruction. Informationintegrated from multiple sources that interface with the health careneeds of a community can provide early warning for the onset of anoutbreak resulting from terrorist activities. Even seemingly smalladvances in early warning timing could save a tremendous number oflives.

However, there are significant limitations with previous attempts atconstructing early warning bio-surveillance systems. Conventionalbio-surveillance focuses on categorical data collected from emergencyrooms, clinics, and other healthcare facilities. The detectionalgorithms in these conventional systems rely on threshold crossingalgorithms applied to single streams of data. Such an approach does notmake optimal use of available information and cannot detect abio-terrorist attack until sizeable numbers of infected individualsappear at healthcare facilities.

Further, conventional bio-surveillance is labor-intensive. For an earlywarning system to be a viable option several processes must beinstituted. First, data from multiple agencies that interface with humanhealth, animal health, and agriculture must be collected and forwardedto a central integration facility. In most systems, a human analyst isneeded to review all the data received to extract indicators of abio-terrorist event. If indicators are found, the analyst needs toassemble the knowledge to form an argument. When an argument issufficiently mature, the analyst must originate alerts to the specificorganizations that need to respond to the incident. This form ofbio-surveillance requires continuous support, delays alerts and may becost prohibitive both for the agencies supporting and analyzing thedata.

A need exists therefore for automated early warning bio-surveillancedetection and alerting system. Such a system should be capable ofoperating continuously with minimal human intervention, and shouldexploit the data collection and analysis capabilities of moderninformation technology and advanced telecommunications.

In one recent approach for a more fully automated early warning system,described in the related PCT application cited above, data from multipledata types indicative of non-specific, flu-like responses to activeagents are collected. A background is generated and subtracted from thedata to form residuals. The residuals are used with a matched filter todetect exposure of a population to biologically active agents. Thematched filter employs replica signals for residuals in the multipledata types based on one or more hypothetical exposure events. Thereplicas are compared to observed residuals to determine when a matchoccurs that indicates the likelihood of an actual outbreak similar tothe hypothetical event at a given level of significance for a givenlimit on false alarms. A system based on this recent approach detects anoutbreak more rapidly than other approaches that rely on a single datatype.

While suitable for many purposes, and offering many advantages overprior approaches, this recent approach also suffers some disadvantages.One disadvantage is that a great deal of processing power is consumed togenerate replicas for even a limited region. This consumption inhibitsthe use of the method over large geographic regions, such as the easternor western United States.

Another disadvantage is that a larger area is subject to more differentphenomena that contribute to variability of the observed data types andthus introduce noise that can mask indications of a localized exposureevent. As a consequence, the signal-to-noise ratio (SNR) for the largerarea is smaller than the SNR in a smaller area that contains theoutbreak. In essence, the signal is diluted over the larger area.

Furthermore, in this recent approach, the background for a particularlocation is determined using a retinal banding approach that determinesthe average value of the data at locations around the particularlocation but excluding the particular location. If the signalencompasses a cluster of several neighboring locations where data arecollected, the background computed using this recent approach maycontain some of the signal and the computed residual may be smaller thanthe actual or predicted residual. This can degrade the detection of anactual localized event by the matched filter.

Based on the foregoing description, there is a clear need for anautomated early warning bio-surveillance detection and alerting systemthat can be scaled up to cover larger areas and that does not suffer thedisadvantages of the other approaches.

SUMMARY OF THE INVENTION

Techniques are provided for early detection of localized exposure to anagent active on a biological population. The techniques includecollecting time series for each data type of multiple different datatypes. The data types are relevant for detecting exposure to the agent.For each data type, multiple time series are collected for correspondingmultiple locations associated with the data type. Measures of anomalousconditions are generated at the locations for each of the different datatypes. The measures of anomalous conditions are based on the time seriesand a temporal model for each data type. Cluster analysis is performedon the measures of anomalous conditions to determine an estimatedlocation, and an estimated extent, of effects from the agent.

In various aspects, the techniques include a method, a computer-readablemedium, and a system that implement the steps described above.

The techniques allow a surveillance system to more rapidly detect anevent by combining signals spread over multiple data types withinformation about expected characteristics of the signal in thosevarious data types. Furthermore, the techniques allow the surveillancesystem to avoid diluting the signal of a localized outbreak over toolarge an analysis area by focusing a detector on a spatial clusteridentified by cluster analysis. In addition, the techniques allow thesurveillance system to avoid consuming excessive resources in computingan exposure event in multiple source detectors, such as an exposureevent associated with a best matched replica in a matched filterdetector, by focusing the application of the multiple source detector inthe vicinity of the cluster.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example, and not by wayof limitation, in the figures of the accompanying drawings and in whichlike reference numerals refer to similar elements and in which:

FIG. 1A is a flow chart that illustrates at a high level a method forearly detection of localized exposure to an agent active on a biologicalpopulation, according to an embodiment;

FIG. 1B is a block diagram that illustrates a system that implements themethod of FIG. 1A, according to an embodiment;

FIG. 2 is a screen shot that illustrates a control interface forcollecting data for the system of FIG. 1B during the data collectionstep of the method depicted in FIG. 1A, according to an embodiment,

FIG. 3A is a graph that illustrates a time series of data from one datatype and expected values for the time series based on an autoregressivetemporal model during the temporal modeling step of the method of FIG.1A, according to an embodiment;

FIG. 3B is a graph that illustrates a time series of data from anotherdata type and expected values for the time series based on a processcontrol temporal model during the temporal modeling step of the methodof FIG. 1A, according to an embodiment;

FIG. 4A is a block diagram that illustrates a spatial relationshipbetween locations associated with time series and circular areas used toform candidate clusters during cluster analysis step of the method ofFIG. 1A, according to an embodiment;

FIG. 4B is a graph that illustrates a resulting cluster in a geographicarea and the locations of time series that fall inside the clusterdetermined during the cluster analysis step of the method of FIG. 1A,according to an embodiment;

FIG. 4C is a graph that illustrates correct cluster detection and falsecluster detection probabilities of the cluster analysis step of themethod of FIG. 1A, according to an embodiment;

FIG. 5A is a graph that illustrates an example outbreak detection thatresults from applying the method of FIG. 1A at one date during the timeseries, according to an embodiment;

FIG. 5B is a graph that illustrates an example outbreak detectionresulting from applying the method of FIG. 1A at a later date during thetime series; and

FIG. 6 is a block diagram that illustrates a computer system upon whichan embodiment of the invention may be implemented.

DETAILED DESCRIPTION

A method and apparatus for early detection of localized exposure to anagent active on a biological population are described. In the followingdescription, for the purposes of explanation, numerous specific detailsare set forth in order to provide a thorough understanding of thepresent invention. It will be apparent, however, to one skilled in theart that the present invention may be practiced without these specificdetails. In other instances, well-known structures and devices are shownin block diagram form in order to avoid unnecessarily obscuring thepresent invention.

Embodiments of the invention are described below in the context ofdetecting flu-like symptoms that are shared by several biological agentsduring early exposure stages. The data types are therefore not specificto any one of those agents. Also the data types comprise time serieswith a temporal resolution of one day.

However, the invention is not limited to this context. For example, inother embodiments, data types indicative of more specific symptoms of aparticular biological agent may be used. Furthermore, in someembodiments data types indicative of exposure to a chemical agent,rather than a biological agent, may be used to alert responsibleauthorities to a chemical attack. In some embodiments, the data may beavailable on a finer time scale, such as reports of human healthproblems accumulated through a 911 emergency reporting system with timeresolutions of hours or minutes.

1. Functional Overview

FIG. 1A is a flow chart that illustrates at a high level a method 100for early detection of localized exposure to an agent active on abiological population, according to an embodiment. Although steps areshown in FIG. 1A in a particular order, for purposes of illustration, inother embodiments the steps may be performed in a different order oroverlapping in time.

During step 110, time series data are collected for each of several datatypes. A deviation that appears in each of several data types is morelikely to reflect a real exposure event than a deviation that appears inonly one data type. The one data type may be subject to an alternativecause or noise that does not occur in another data type. Therefore it isconsidered extra useful to collect data from multiple data types in thesame region. Data collection of multiple data types for an exampleembodiment is described in more detail below in sub-section 3.

During step 120, a temporal model is formed for each time series. Adifferent type of temporal model may be formed for each data type. Anindividual temporal model of the given type is then formed for anindividual time series of the associated data type by fitting parametersof the model to the data for a portion of the time series during whichit is expected that no exposure event has occurred. Such a portion couldbe selected from a time that precedes the current time by an amountlarge compared to the incubation period of the agents of interest.

Forming temporal models for multiple data types for an exampleembodiment is described in more detail below in sub-section 4. In theexample embodiment, the type of temporal model is developed once foreach data type, during a research activity that may precede thecollection step 110. During the collection step 110, an individualtemporal model of the temporal model type is formed by fitting theportion of the time series with the appropriate model type to determinevalues for any parameters of the model type. In some embodiments, one ormore of the model types do not have adjustable parameters that aredetermined by fitting a portion of the time series; and the sameindividual model is associated with each time series of the data type.

During step 130, an expected value is determined for the current timefor each time series for all the data types. Each expected value isobtained by applying the individual, fitted temporal models to the timeseries preceding the current time.

In step 132, it is determined whether the actual values at the currenttime deviate from the expected values by more than a threshold amount.If so, then control passes to steps 139 and beyond to further examinethe actual and expected values for this time series (and, possibly,nearby time series) in order to detect an outbreak and determine anassociated exposure event. If not, then control passes to step 140 toperform cluster analysis, described in more detail below. It may be thateach of 10 adjoining zipcodes gets 1 or 2 additional cases that do notlook unusual to any individual temporal detector. In some embodiments,step 132 represents a step taken by a temporal detector.

In step 139, it is determined whether deviations between actual andexpected values are real and make obvious the existence of an outbreakcaused by an exposure event. Any method known in the art for determiningan obvious, real deviation may be used. For example, if the deviationhas a size that is several standard deviations of normal variationsabout the expected value for the data type, and if the other deviationsof similar size are detected in adjacent times of the same time seriesor adjacent locations in other time series, or both, then the deviationmay be considered both real and indicative of an outbreak. If it isdetermined in step 139 that the deviations are obvious and real, thencontrol passes to step 158 to notify authorities of an exposure alert.In some embodiments, step 139 represents a step taken by adeviation-validity-check component of a surveillance system.

It is expected in many cases that deviations from expected values aresubtle and are not obviously the result of a real outbreak from anactual exposure event. For example, similar deviations are sometimesobserved without a real outbreak from an actual exposure event. An alertbased on such deviations would too often result in a false alarm. Falsealarm rates that are too high undermine the effectiveness of an alertingsystem. In such cases, control passes to step 140 and beyond to applymore sophisticated detection techniques.

In step 140, spatial cluster analysis is performed on the currentdeviations at the multiple locations associated with each of themultiple data types. Performing cluster analysis on multiple data typesfor an example embodiment is described in more detail below insub-section 5. In some embodiments, step 140 represents a step taken bya spatial-cluster-analyzer component of a surveillance system. Anycluster analysis approach known in the art at the time the surveillancesystem is built may be used. In typical embodiments, the result of step140 is a most likely cluster location, cluster spatial size (extent) andsignal size (amplitude) inside the cluster, or measure of the likelihoodthat the cluster is real. Control then passes to step 149 and beyond todetermine if the cluster analysis results indicate a real outbreakassociated with an actual exposure event.

In step 149, it is determined whether the signal size is real and makesobvious the existence of an outbreak caused by an exposure event. Anymethod known in the art for determining an obvious, real cluster may beused. If it is determined in step 149 that the cluster amplitudeindicates an obvious and real outbreak, then control passes to step 158to notify authorities of an exposure alert. In some embodiments, step149 represents a step taken by a cluster-validity-check component of asurveillance system.

It is expected in many cases that the cluster amplitude suggests anoutbreak but not does not make it obvious that a real outbreak hasoccurred. For example, clusters of the same amplitude are sometimesobserved in the absence of an outbreak from a real exposure event, sothat an alert based on such a cluster has an unacceptably high chance ofbeing a false alarm. In such cases, control passes to step 150 andbeyond to apply more sophisticated detection techniques.

In step 150, a multiple data type (“multiple source”) detector is usedin the vicinity of the cluster in order to determine whether an actualexposure event near the cluster is most likely the cause of deviationsfrom expected values. Any multiple source detector known at the time thesystem is built may be used. Performing detection on multiple data typesfor an example embodiment is described in more detail below insub-section 6. In the embodiments described below, a multiple source,matched filter is used with the deviations to detect an exposure event.In typical embodiments, the result of step 150 is a most likely exposureevent location and exposure event time and exposure event significancelevel. Control then passes to step 156 and beyond to send an alert ifthe exposure event is likely enough to be real.

In some embodiments, step 150 includes step 152. In step 152, ananalysis region is refined and the multiple source detector is appliedagain. Any method of refining the analysis region from the clusterlocation and size may be used. In some embodiments, the cluster analysisis run again for finer spatial scale data. For example, schoolabsenteeism data originally reported by school district is replaced byabsenteeism data at individual schools in one or more school districtsnear the exposure event location; and cluster analysis step 140 is runagain. Refining the analysis region for an example embodiment isdescribed in more detail below in sub-section 6. In the embodimentdescribed below, the analysis region is refined by running replicas forthe matched filter at individual schools or stores, or both, near theexposure event location first computed, instead of at centroids ofschool districts and store accounting groups.

In step 156, it is determined whether an exposure event is detected withenough significance that false alarm rates are acceptably low. If so,then control passes to step 159 to notify authorities of an exposurealert. If not, then control passes back to step 110 to continuecollecting time series data.

In step 158, an exposure alert notification is sent to authorities. Anyinformation of use to the authorities may be included in the alert. Forexample, the alert includes the time and location and significance ofthe exposure event detected by the multiple source detector and alsoincludes the current size and extent of the outbreak as determined bythe cluster analysis and exposure event.

2. Structural Overview

FIG. 1B is a block diagram that illustrates a system 160 that implementsthe method of FIG. 1A, according to an embodiment.

System 160 includes data structures 162 that store time series data. Anydata structures for storing time series data known in the art may beused. For example, in some embodiments, data structures 162 are one ormore database objects in a database system. In some embodiments, datastructures are files in a file system.

A variety of data types are stored in time series data structures 162.Data types are selected to indicate population health that may beaffected by exposure to the active agents. In a related applicationcited above, PCT Appln. Ser. No. PCT/US01/09244, filed Mar. 23, 2001 byLombardo et al. (hereinafter “Lombardo”), a list of multiple differentdata types are suggested. Based on that list, the following data typesare suggested as examples of different data types:

1) high school absentee data—daily absentee and total enrollment figuresfrom public schools in one or more school districts or counties;

2) over the counter (OTC) pharmaceutical sales—sales records for the top30 products for relief of flu symptoms from drug store chains;

3) emergency room (ER) admissions data—records for admission tohospitals in one or more counties for ER codes that are related tovarious symptoms of illness;

4) insurance claim billing records—records of insurance claims forinsurance codes related to symptoms of illness from a state agency;

5) nursing home illness records—records of employee and resident upperrespiratory illnesses from nursing homes in one or more counties; and

6) results of laboratory tests—records of influenza test results from astate health department.

In the illustrated embodiment, time series data structures 162 includetime series data structures 162 a, 162 b, 162 c, 162 d and ellipses 163representing other time series data structures, not explicitly depicted.Data structure 162 a holds time series data based on insurance claimsand ER visits for upper respiratory symptoms segregated by patient zipcode. Data structure 162 b holds time series data based on insuranceclaims and ER visits for gastrointestinal (GI) symptoms segregated bypatient zip code. Data structure 162 c holds time series data based onOTC sales counts segregated by group of stores in a catchment area or byindividual store. Data structure 162 d holds time series data based onabsenteeism (count or percent) segregated by school or school districtor county. A location is associated with each time series. Time seriesdata that represents an area, such as a county or zip code, isconsidered to occupy a location given by a representative location inthe area represented, such as a centroid of the area represented. Moredetails on collecting time series data is provided below in sub-section3. In other embodiments, other data types are used.

The system 160 includes multiple components called temporal detectors164. Each temporal detector 164, as well as other components depicted inFIG. 1B, may be a separate process or part of a larger process; eachcomponent may run on a separate processor dedicated to the process ormay share time on the same processor with one or more other processes.

The time series data from data structures 162 are fed into temporaldetectors 164. A different temporal detector may be used for differenttime series. The temporal detectors 164 perform at least one of thesteps 120, 130 depicted in FIG. 1A for determining the expected andactual values of the time series at the current time. In the illustratedembodiment, time series data from data structure 162 a is input totemporal detector 164 a, time series data from data structure 162 b isinput to temporal detector 164 b, time series data from data structure162 c is input to temporal detector 164 ca, time series data from datastructure 162 d is input to temporal detector 164 d, and time seriesdata from data structures represented by ellipsis 163 are input totemporal detectors represented by ellipsis 165. More details on thetemporal detectors are provided below in sub-section 4.

The system 160 includes components called an outlier selector 166, avalidity check 168, and an alert 190. Anomalous conditions detected byone or more of temporal detectors 164, based on the expected and actualvalues for the current time, are input to outlier selector 166. Theoutlier selector 166 selects any pair of expected and actual values thatrepresents a deviation that is unusually large, such as a deviation offour standard deviations or more. Any such pair is input to the validitycheck 168 to determine whether the deviation is real, or is due to noiseor other error in the data. If the deviation is determined to be real,data is sent to alert component 190 to notify authorities of thedeviation. The components 166, 168 perform the function of step 139 inFIG. IA.

The system 160 includes a component called a spatial cluster analyzer170. The spatial cluster analyzer 170 performs step 140 depicted in FIG.1A for determining the most likely one or more clusters based on theexpected and actual values of the time series at the current time. Inthe illustrated embodiment, anomalous conditions represented by expectedand actual values at the current time for the multiple time series atmultiple locations are input to spatial cluster analyzer 170. Data notselected by the outlier selector 166, is input to the spatial clusteranalyzer 170. In some embodiments, outliers that could not be determinedto be real are also input to spatial cluster analyzer 170; in otherembodiments, outliers that could not be determined to be real arerejected and not used in further processing. More details on the spatialcluster analyzer 170 are provided below in sub-section 5.

The system 160 includes a second set of components called outlierselector 172 and a validity check 174. Significant clusters detected bythe spatial cluster analyzer 170 are input to outlier selector 172. Theoutlier selector 172 selects any cluster that has an unusually largesignificance, such as a significance level of 0.05 or less. Any suchcluster is input to the validity check 174 to determine whether thecluster is real, or is due to noise or other error in the data. If thecluster is determined to be real, data is sent to alert component 190 tonotify authorities of the cluster. The components 172, 174 perform thefunction of step 149 in FIG. 1A.

The system 160 includes a component called a multiple source detector180. The multiple source detector 180 performs at least one of steps150, 156 depicted in FIG. 1A for determining an estimated location andtime of an exposure event that leads to the observed cluster ofanomalous conditions. One or more clusters not selected by the outlierselector 172, are input to the multiple source detector 180. In someembodiments, clusters that could not be determined to be real outbreaksare also input to multiple source detector 180; in other embodiments,clusters that could not be determined to be real are rejected and notused in further processing. If an exposure event is detected that islikely enough to be real, then data is sent to the alert component 190.More details on the multiple source detector 180 are provided below insub-section 6.

The system 160 includes a component called event location optimizer 182.The event location optimizer 182 performs step 152 depicted in FIG. 1Afor refining an analysis area for determining a modified location andtime of the exposure event. More details on the event location optimizer182 are provided below in sub-section 6.

3. Collecting Time Series Data

FIG. 2 is a screen shot 200 that illustrates a control interface forcollecting data for the system of FIG. 1B during the data collectionstep 110 of the method 100 depicted in FIG. 1A, according to anembodiment. FIG. 2 shows the control form and the data specificationform for outpatient visits as screen shot 200. According to thisembodiment, time series data are stored in an ACCESS database availablefrom Microsoft Corporation; thus, the data structures 162 are datastructures in a Microsoft ACCESS database.

The screen shot 200 includes two windows 210, 220. A first window 210 isused to select the types of claims to form at least one of the timeseries to be used by the system 160. In the illustrated embodiment,window 210 is used to select insurance claims and ER visits by malesfive years of age and younger in zip code 2001, who show a fever; thedata is reported as a ratio of all claims. A second window 220 is usedto select all the time series to be used by the system 160. In theillustrated embodiment, window 220 is used to select military ER claims,two types of civilian claims (insurance and ER), OTC sales by two drugstore chains, and school absentee data in three counties.

Thus, in the illustrated embodiment, time series of several differentdata types are combined to detect an outbreak of symptoms and todetermine an exposure event that leads to the outbreak. ThisACCESS-based system allows analysts to include or exclude data sources,vary time windows separately for different data sources, censor datafrom subsets of individual providers or sub-regions, adjust thebackground computation method, and run retrospective and/or simulatedstudies.

4. Temporal Models Specific to Data Type

Temporal models used in data-type specific temporal detectors 164 of theillustrated embodiment fall into two main categories. One category oftemporal models includes temporal pattern models; the other categoryincludes process control models. In other embodiments, other temporalmodels or spatial models or combined models for one or more data typesmay be used.

4. 1 Temporal Pattern Example

Temporal pattern models characterize specific features of the timeseries, such as a seasonal or weekly pattern. These models includegeneral linear mixed models that predict a value at a next time based ona linear combination of observable parameters at present or past times.Models in this category include Poisson, multivariate, linear, logisticregression, and autoregressive models, all well known in the art.

FIG. 3A is a graph 300 that illustrates a time series of data from onedata type and expected values for the time series based on anautoregressive temporal model during the temporal modeling step 120 ofthe method 100 of FIG. 1A, according to an embodiment. According to thisautoregressive model, the predicted value “Y” of a time series at time“t”, represented by the symbol “Yt” is given by Equation 1a:Yt=Xt*b+Vt  (1a)where Xt is a value of a function “X” of time at time t, b is adeterministic correction factor based on such factors as day of the weekor time relative to a holiday, among others, and Vt is a deviation “V”at time t. The deviation Vt is a function of a random error term anddeviations observed at several preceding times, as given by Equation 1b:Vt=εt−φ1*Vt−1−φ2*Vt−2−φ3*Vt−3− . . . −φm*Vt−m  (1b)where εt is normally distributed with a mean of zero and a variance ofσ², and the coefficients φ are determined based on fitting the model todata that does not contain a localized exposure event, such as anaccident or hostile attack. This autoregressive model is well known inthe art and can be applied using commercially available software such asSAS.

In an illustrated embodiment, this autoregressive model has been appliedusing SAS software to model time series of insurance claims indicatingvarious symptoms (such as upper respiratory infection symptoms, lowerrespiratory infection symptoms, and gastro-intestinal symptoms), and OTCsales. The term Xt*b has been used to correct for weekend effects,holiday effects, post-holiday effects, and seasonal effects. For datawith more than 10 counts per day, the degree of fit, measured by thestatistic R², is good, indicating a good fit to the data.

FIG. 3A depicts a graph 300 of two curves 310, 320 representing two timeseries. The horizontal axis 302 is date indicated by month/day for atime interval from Nov. 25, 2000 through Feb. 13, 2001. The verticalaxis 304 is the count of claims filed that report lower respiratoryinfection (LRI) symptoms for an analysis region in the national capitalarea. Curve 310 represents a time series of observations. Theseobservations are based on actual claims with an artificial signal addedafter Jan. 1, 2001 to represent an exposure event on Jan. 1, 2001. Curve320 represents a time series of predictions by the autoregressive model.The data curve 310 shows a weekly temporal pattern. There are few countson two weekend days each week, when many offices are closed, and extracounts on Monday, when the weekend cases are added to the reports madethat day. The data curve 310 also shows a seasonal temporal pattern. Thecounts increase in January compared to November and December.

The prediction curve 320 tracks the claims curve 310 quite wellincluding the weekly and seasonal patterns. However, the predictioncurve is substantially below the data curve 310 for Monday peaks betweendates January 4 and about January 31 when the artificial signal waseffective. The asterisk marks point 312 where the data curve 310deviates sufficiently from the prediction curve 320 to cross a thresholdused to detect anomalous conditions.

4.2 Process Control Example

Process control models are used to detect small deviations in thetolerances of manufactured items. Models in this category includecumulative summation (CUSUM) and exponential weighted moving average(EWMA) models, well known in the art.

FIG. 3B is a graph 350 that illustrates a time series of data fromanother data type and expected values for the time series based on aprocess control temporal model during the temporal modeling step 120 ofthe method 100 of FIG. 1A, according to an embodiment. According to thisCUSUM model, a smoothed value “S” of a time series at time “t”,represented by the symbol “St” is obtained from a data stream ofobservations “O” at one or more previous times. An example ofexponential smoothing is given by Equation 2a:St=ω*Ot−1+(1−ω)*St−1  (2a)where ω has a value between zero and 1. The deviations between St and Otfor several values of t are used to derive a root mean variance σ_(t),and the normalized deviation “Z” at time t, represented by the symbol“Zt” is obtained using Equation 2b:Zt=(Ot−St)/σ_(t)  (2b)The cumulative sums “S_(H)” and “S_(L)” are computed according toEquations 2c and 2d, respectively:S _(H)=maximum of 0 and (Zt−k)+old S _(H)  (2c)S _(L)=maximum of 0 and (−Zt−k)+old S _(L)  (2d)The values of S_(H) and S_(L) are then compared to a threshold “h”indicating significant deviations. The values of ω, h and k, and amethod for estimating σ_(t), are tuned using test data to provide theearliest reliable alerts.

In the illustrated embodiment, this CUSUM method is used as a temporalmodel with emergency room (ER) visits which show less drastic temporalpatterns than are shown by insurance claims. When the CUSUM method wastuned to theses data, the value of the threshold h was determined to be1.

FIG. 3B depicts a graph 350 of two curves 360, 370 representing two timeseries. The horizontal axis 352 is date indicated by month/day for atime interval from Dec. 30, 2000 through Feb. 28, 2001. The verticalaxis 354 is the count of respiratory cases in ERs for an analysis regionin the national capital area. Curve 360 represents a time series ofobservations. These observations are based on actual cases with anartificial signal added after Jan. 20, 2001 to represent an exposureevent on Jan. 20, 2001. Curve 370 represents a time series of smoothedvalues using Equation 2a. Point 362 marks a time when the value of S_(H)exceeds the threshold 1 and Point 364 marks a time when the value ofS_(L) exceeds the threshold 1. Thus points 362 and 364 representanomalous conditions for ER respiratory cases.

5. Cluster Analysis

Cluster analysis is a well-known technique for finding spatialconcentrations in values for a single data type. For example, a methodof cluster analysis is described in “A spatial scan statistic,” M.Kulldorff, Communications in Statistics: Theory and Methods, v26, 1997,pp1481-1496, and “Spatial scan statistics: models, calculations, andapplications,” by M. Kulldorff, Scan Statistics and Applications, J.Glaz, Ed., Birkhauser, Boston, 1999, pp 303-322 (hereinafter,collectively referenced as Kulldorff). Kulldorff presents a generalizedspatial scan statistic which can be prepared from data of diseaseoccurrence in a population for use in determining the location andextent of circles that enclose the most likely clusters of the disease.The generalized scan statistic is based on a pair of values: 1) anactual count for occurrences of the disease in an area; and 2) anexpected value based on the population in the area and a rate ofoccurrence of the disease in the general population. Software (called“Satscan”) based on the cluster analysis of Kulldorff is available atthe website of the National Cancer Institute.

FIG. 4A is a block diagram that illustrates a spatial relationshipbetween locations associated with time series and circular areas used toform candidate clusters during cluster analysis step 140 of the method100 of FIG. 1A, according to an embodiment. The analysis region 400includes locations 402 for multiple time series of data from one or moredata types. Locations 402 include locations 402 a, 402 b, 402 c, 402 d,402 e, 402 f, 402 g, 402 h, 402 i, 402 j, 402 k, among others, notshown. A time series of data associated with an area is represented by acentroid or other representative location for the area. A series ofconcentric candidate circles are constructed around each location in theanalysis region 400 to determine whether a cluster might be centered onthat location. Projecting the circles in a time dimension perpendicularto the analysis region 400 forms corresponding “cylinders”. In theillustrated example, candidate circles 410 are centered on location 402a. Candidate circles 410 include concentric candidate circles 410 a, 410b, 410 c, 410 d, 410 e, among others, not shown. For each candidatecircle, a likelihood ratio of event counts inside a correspondingcylinder relative to the event counts in the entire region isdetermined, within some time and space limits. The most likely spatialcluster is then the one or more areas whose representative locations arewithin the circular base of the cylinder with the maximum likelihoodratio. For example, if the cylinder with the maximum likelihood ratiohas base circle 410 d, then the areas represented by locations 402 a,402 h and 402 i combine to form the most likely cluster.

According to embodiments of the present invention, unlike Kulldorff, thedata at the locations 402 can be different data types. For some datatypes there may be no known rate of occurrence in the general populationor no known underlying population. The data types may representoverlapping areas, such as counties and store catchment areas. The datatypes are combined in the cluster analysis by presenting both theobserved value at each location and the predicted value from thetemporal model. In embodiments that use software based on the Kulldorffapproach, if two data types have the same centroids, or otherrepresentative locations, then one or both of the data types areassociated with a different representative location so that no twolocations provided to the software have the same location. Typically thedifferent location is spatially close to the original location.

Given a subdivision of a surveillance region into sub-regions, theSatscan software is designed to find one or more clusters of thesub-regions where combined data counts are most unlikely due to normalfluctuations, and designed to evaluate the significance of theseclusters, e.g., by estimating how unlikely the counts in the clustersare.

Candidate clusters are formed by considering each of a family of circlescentered at each of a set of grid points—often taken as the full set ofsub-region centroids. A candidate cluster comprises sub-regions whosecentroids lie in the associated circle. For each grid point, candidatecluster sizes range from a single sub-region up to a preset maximumfraction of the total case count N. In Satscan, a statistic called thelikelihood ratio (LR) is computed for each candidate cluster, as givenby Equation 3:LR(J)=O(J)/E(J)^(O(J)) *{[N−O(J)]/[N−E(J)]}^([N−O(J)])  (3)where J refers to the set of sub-regions whose centroids lie in acandidate circle, O(J) is the sum of the observed counts in thesub-regions included in J, E(J) is the sum of the expected counts in thesub-regions included in J, and N is the total number of cases in theregion.

The cluster J* with the larges value of LR over the sets J obtained fromall grid centers and all radii up to a fixed limit is then the maximumlikelihood cluster. Satscan determines a p-value estimate for thestatistical significance of this cluster empirically by ranking thevalue of LR(J*) against other maximum likelihood ratios, each calculatedsimilarly from a random sample of the N cases based on the expectedspatial distribution. The p-value indicates the probability that thecount is observed by chance due to normal fluctuations. Once a set ofsub-regions is associated with a maximal cluster, Satscan choosessecondary clusters and assigns them significance levels from thesuccessively remaining sub-regions.

In illustrated embodiments, Satscan is adapted to work with differentdata types. In the conventional use of Satscan, expected values for thesub-regions are calculated from the respective populations, assuminguniform spatial incidence. However, counts from many of the differentdata sources are not population-based. For example, the distribution ofinsurance claim data depends on factors such as the distribution ofeligible consumers and participating care providers and day of the week.We have derived expected counts from temporal modeling of individualsub-region counts and from recent data history. A common technique is touse the spatial distribution of counts from a baseline interval chosenlong enough to represent the entire region yet recent enough torepresent temporal trends.

For combining counts from multiple sources, different data types weretreated as covariates so that Satscan could operate on them directly.Expected values for each source are calculated from source-specificmodeling. Once expected values are computed, covariate observed andexpected counts are summed and the likelihood ratio statistic iscomputed. This approach has been applied to multiple sources of medicaldata treated separately, to absentee counts from different countiesnormalized by county schedule, and to OTC sales from separate storechains. This approach allows the mixture of data organized by suchvariables as patient residence zip-code, provider location, and store orschool address. When adding a new data source, a new covariate number isassigned and the new locations are appended to the aggregate file ofspatial coordinates, provided only that exact coordinates are notrepeated and that each zip-code or site has a unique identifying string.Expected and observed counts for the new source are then tabulated andincluded as covariate counts along with counts of the remaining datasources. The spatial clustering includes locations of all the variousdata sources.

Detailed data analysis is often desirable before a new data source isincluded in the surveillance clustering. Without such analysis, applyinga scan statistic may produce spurious clusters that can mask thespace-time interaction of interest. The general principle is to includethe most “signal,” or cases of interest, with the least “noise.”Specific analysis issues are the selection of the outcome variable andthe method for choosing the expected spatial distribution. Choice of anoutcome variable is important in the use of diagnosis counts forclustering. For medical data, syndromic surveillance is used, e.g.,monitoring counts of outpatient visits by diagnoses falling in any ofseveral syndrome groups.

To illustrate these principles, an embodiment appropriate for aparticular surveillance system is herein described. The U.S. Departmentof Defense Global Emerging Infections System (DoD-GEIS) has developedthe Electronic Surveillance System for the Early Notification ofCommunity-based Epidemics (ESSENCE) to enable outbreak alerting usingsyndromic surveillance. ESSENCE monitors over 100 primary care andemergency clinics in the National Capital Area (NCA) and, collectsapproximately 100,000 claims per day several times daily from militarytreatment facilities worldwide. ESSENCE II, an extension of this system,collects both civilian and military data in the NCA, plus less specificbut potentially timelier indicators, such as records of over-the-counter(OTC) remedy sales and school absenteeism. Principal objectives ofESSENCE II are the early identification, characterization, and trackingof disease outbreaks.

For the ESSENCE project, seven syndrome groups were chosen by DoD-GEISfor surveillance: respiratory, gastrointestinal, fever, dermatologicinfectious, dermatologic hemorrhagic, neurologic, and coma. ESSENCEincrements the count for a syndrome group each time a diagnosis codefalls in the corresponding list.

The spatial and temporal behavior of the various syndrome group counts,especially during cold season, are examined to refine the syndromegroups and subgroups for more sensitive, specific clustering. To reducenoisy temporal behavior at the local level that can lead to excessiveclustering, each source of data is evaluated before being included inthe analysis. For example, absentee counts from a school that oftenskips reporting or whose counts are especially erratic would beexcluded. For OTC sale data, counts are usually restricted to sales ofinfluenza or diarrhea remedies.

5.1 Application to Real Cases

Combinations of data sources for both retrospective studies of knownoutbreaks and surveillance of high-profile events of concern to localpublic health authorities have been processed. FIG. 4B is arepresentative portion of an output file. FIG. 4B is a graph thatillustrates a resulting cluster in a geographic area and the locationsof time series that fall inside the cluster determined during thecluster analysis step 140 of the method 100 of FIG. 1A, according to anembodiment. A primary cluster has a location represented by the centerof the circle 420 and an extent given by the sub-regions, which haverepresentative locations within the circle 420. A radius of circle 420may be used as a proxy for the extent of the cluster. The locations oftime series are shown as solid symbols 422 inside the circle 420.

Different symbol shapes represent different data types. For example,school symbols, like school symbol having a circular base 422 a andtriangular flag 422 b, represent locations of time series of schoolabsenteeism data type, and diamonds, like symbol 422 c, representlocations of time series of pharmacy sales data type. Zip code centroidsrepresenting patient residential zipcodes in medical data were notplotted to avoid a cluttered figure. Note that clusters may includesites from any combination of the included data sources.

A secondary cluster is associated with one time series at the center ofthe circle 430.

5.2 Simulations

In the absence of substantial disease outbreaks to demonstrate theadvantage of clustering with multiple data sources, simulations are usedto examine the potential advantage in the event of a localized attack. Apurely spatial Monte Carlo simulation is here described as an example.

For a particular data source, for example, for counts of claims from therespiratory syndrome, expected spatial probabilities for the sub-regions(e.g., patient zip-codes) in the surveillance regions are assumed. Theclusters produced using the scan statistic with many repetitions of thefollowing procedure are examined.

-   -   1) For a set of background cases, compute a spatial case        distribution with a multinomial random draw based on expected        spatial probabilities.    -   2) For a test signal, choose an outbreak epicenter, e.g., an        exposure event, in the surveillance region for each test        background. Compute a signal probability distribution over the        sub-regions, which decays exponentially with the distance from        the epicenter. The signal is then a small number of additional        cases chosen from this distribution with another multinomial        draw.    -   3) Add the background and signal cases and find the maximum        likelihood clusters with a spatial scan statistic.

For each of these clustering attempts, it is determined, for a thresholdvalue “T”, in what fraction of all runs is there a computed cluster,containing the epicenter, whose scan statistic exceeds T, and in whatfraction is there a computed false cluster whose scan statistic exceedsT. By varying this threshold over the values obtained for computedclusters, a curve is obtained, which is similar to a receiver operatingcharacteristic (ROC) curve that plots the probability of finding theoutbreak versus the probability of a false cluster. A graph can beplotted that illustrates correct cluster detection and false clusterdetection probabilities of the cluster analysis step of the method ofFIG. 1A, according to an embodiment. On the graph, the horizontal axisrepresents the probability of detecting a false cluster; and thevertical axis represents the probability of detecting a correct clusterthat includes the epicenter.

In exemplary cases that can be graphed as described above, the number ofoutbreak cases is 10% of the number of background cases. A first curveand a second curve are computed by clustering with respiratory claimsalone, and OTC anti-flu sales alone, respectively. A third curve iscomputed by clustering with both data sources. For reasonable detectionprobabilities, a substantial gain is evident when the sources arecombined. For example, at a correct cluster detection probability of0.6, a false cluster detection probability is about 0.5 using one datatype (first or second curves) and about 0.2 using both data types (thirdcurve), a reduction of false clusters by a factor of about 2.5 if bothdata types are combined.

This technique has several applications. It may be used to assess themarginal surveillance value of a single data source or to check forrobustness of the clustering method as the spatial case distributionevolves. It may also be used to compare the performance of thelikelihood ratio statistic used in Satscan to other possible scanstatistics used in other embodiments, including methods based oncontingency tables.

In the illustrated embodiment, disparate sources are treated ascovariates whose counts and expected values are summed for the loglikelihood ratio. In other embodiments other approaches may be taken.For example the likelihood ratios may be computed separately fordisparate sources and then their logarithms may be summed. Preliminarytests suggest that this statistic can prevent one noisy source frommasking a signal in another; however, this statistic may lose power todetect a faint signal with traces in all sources. In other embodimentsthe counts and expected values are weighted by weights determined any ofseveral ways known in the art, or the counts are normalized by varianceof the data in the data type.

It is expected that, using these techniques, increases in early outbreakalerting capability can be achieved as the number of data sources andpromptness of data reporting increase.

6. Matched Filter Detector Using Multiple Data Types

Clusters identified by Satscan or by the modified methods describedabove should be understood as approximate locations of concentrated datacounts that may indicate an outbreak of disease. The statisticalsignificance and persistence of these clusters should be used toevaluate their importance. They are also valuable as cues for andcorroboration of other surveillance measures, such as multi sourcematched filters described in Lombardo.

As described in Lombardo, replica time series are generated in theappropriate data type for one or more locations based on modeling theeffects of one or more hypothetical exposure events (epicenters).According to some embodiments, the hypothetical exposure events arecentered at or near the center of the cluster, and replica time seriesare generated for locations inside the cluster where data are available.Time-domain covariance techniques are applied to seek a likely matchbetween the replica and the data at time of the matching exposure event.The hypothetical event that produces the most likely match is taken asthe most likely event. If the significance of the match is high enough,authorities are alerted. The alert includes at least some of the timeand location of the most likely event and the significance of the matchand perhaps, the location and extent of the cluster.

By confining the matched filter detector to areas in the vicinity of thecluster, substantial computational resources and response time aresaved. This aids in obtaining the earliest possible detection of anexposure event.

In some embodiments, the most likely event is used to refine theanalysis area, and the matched filter is reapplied. For example,computed relative risks of individual subregions in or near thecandidate cluster may be used to exclude or annex subregions to obtainthe next cluster candidate, subject to spatial restrictions. In anotherembodiment, if the best match is not at a time series location at thecenter of the cluster, where the exposure event is located, then a newexposure event, and associated replicas, are generated, centered on thearea that gave the best match in the previous round.

In some embodiments, obtaining a cluster of low significance level isused to focus attention on the cluster. Authorities may be advised thatan outbreak is possible and more analysis is required. In someembodiments, the data used to define a maximum likelihood cluster arereviewed and time series of marginal quality are dropped out, and thecluster analysis is run again.

Thus the analysis area and extent are refined to more precisely locateand time the event and obtain significant matches.

FIG. 5A is a graph 500 that illustrates an example outbreak detectionthat results from applying the method of FIG. 1A at one date during thetime series, according to an embodiment. Graph 500 shows a map ofsub-regions and two areas where significant outbreaks are detected as ofJan. 18, 2001. One outbreak, indicated by box 510, is associated with 11cases in 7 days; another outbreak, indicated by box 520, is associatedwith 20 cases in 11 days. The probability that such outbreaks would becaused by random errors of normal variability is less than 0.001 in bothcases; thus the outbreaks are highly significant. .]These data are froma retrospective study where an epidemiologist indicated that a scarletfever outbreak had occurred. Our outcome variable in each time serieswas the number of cases of diagnosis code 034, scarlet fever, or 034.1,strep throat due to scarlet fever. Such cases are relatively rare, socase counts were compared to the population-based incidence.

FIG. 5B is a graph 550 that illustrates an example outbreak detectionresulting from applying the method of FIG. 1A at a later date during thetime series. Graph 550 shows a map of the same sub-regions as shown inFIG. 5A but as of Jan. 26, 2001. Three areas with significant outbreaksare detected as of Jan. 26, 2001. One outbreak, indicated by box 560, isassociated with 10 cases in 5 days; another outbreak, indicated by box570, is associated with 15 cases in 12 days; another outbreak, indicatedby box 580, is associated with 11 cases in 7 days. The probability thatsuch outbreaks would be caused by random errors of normal variability isless than 0.001 for box 580, about 0.002 for box 570, and about 0.013for box 560; thus the outbreaks are significant.

7. Hardware Overview

FIG. 6 is a block diagram that illustrates a computer system 600 uponwhich an embodiment of the invention may be implemented. Computer system600 includes a communication mechanism such as a bus 610 for passinginformation between other internal and external components of thecomputer system 600. Information is represented as physical signals of ameasurable phenomenon, typically electric voltages, but including, inother embodiments, such phenomena as magnetic, electromagnetic,pressure, chemical, molecular and atomic interactions. For example,north and south magnetic fields, or a zero and non-zero zero electricvoltage, represent two states (0, 1) of a binary digit (bit). A sequenceof binary digits constitutes digital data that is used to represent anumber or code for a character. A bus 610 includes many parallelconductors of information so that information is transferred quicklyamong devices coupled to the bus 610. One or more processors 602 forprocessing information are coupled with the bus 610. A processor 602performs a set of operations on information. The set of operationsinclude bringing information in from the bus 610 and placing informationon the bus 610. The set of operations also typically include comparingtwo or more units of information, shifting positions of units ofinformation, and combining two or more units of information, such as byaddition or multiplication. A sequence of operations to be executed bythe processor 602 constitute computer instructions.

Computer system 600 also includes a memory 604 coupled to bus 610. Thememory 604, such as a random access memory (RAM) or other dynamicstorage device, stores information including computer instructions.Dynamic memory allows information stored therein to be changed by thecomputer system 600. RAM allows a unit of information stored at alocation called a memory address to be stored and retrievedindependently of information at neighboring addresses. The memory 604 isalso used by the processor 602 to store temporary values duringexecution of computer instructions. The computer system 600 alsoincludes a read only memory (ROM) 606 or other static storage devicecoupled to the bus 610 for storing static information, includinginstructions, that is not changed by the computer system 600. Alsocoupled to bus 610 is a non-volatile (persistent) storage device 608,such as a magnetic disk or optical disk, for storing information,including instructions, that persists even when the computer system 600is turned off or otherwise loses power.

Information, including instructions, is provided to the bus 610 for useby the processor from an external input device 612, such as a keyboardcontaining alphanumeric keys operated by a human user, or a sensor. Asensor detects conditions in its vicinity and transforms thosedetections into signals compatible with the signals used to representinformation in computer system 600. Other external devices coupled tobus 610, used primarily for interacting with humans, include a displaydevice 614, such as a cathode ray tube (CRT) or a liquid crystal display(LCD), for presenting images, and a pointing device 616, such as a mouseor a trackball or cursor direction keys, for controlling a position of asmall cursor image presented on the display 614 and issuing commandsassociated with graphical elements presented on the display 614.

In the illustrated embodiment, special purpose hardware, such as anapplication specific integrated circuit (IC) 620, is coupled to bus 610.The special purpose hardware is configured to perform operations notperformed by processor 602 quickly enough for special purposes. Examplesof application specific ICs include graphics accelerator cards forgenerating images for display 614, cryptographic boards for encryptingand decrypting messages sent over a network, speech recognition, andinterfaces to special external devices, such as robotic arms and medicalscanning equipment that repeatedly perform some complex sequence ofoperations that are more efficiently implemented in hardware.

Computer system 600 also includes one or more instances of acommunications interface 670 coupled to bus 610. Communication interface670 provides a two-way communication coupling to a variety of externaldevices that operate with their own processors, such as printers,scanners and external disks. In general the coupling is with a networklink 678 that is connected to a local network 680 to which a variety ofexternal devices with their own processors are connected. For example,communication interface 670 may be a parallel port or a serial port or auniversal serial bus (USB) port on a personal computer. In someembodiments, communications interface 670 is an integrated servicesdigital network (ISDN) card or a digital subscriber line (DSL) card or atelephone modem that provides an information communication connection toa corresponding type of telephone line. In some embodiments, acommunication interface 670 is a cable modem that converts signals onbus 610 into signals for a communication connection over a coaxial cableor into optical signals for a communication connection over a fiberoptic cable. As another example, communications interface 670 may be alocal area network (LAN) card to provide a data communication connectionto a compatible LAN, such as Ethernet. Wireless links may also beimplemented. For wireless links, the communications interface 670 sendsand receives electrical, acoustic or electromagnetic signals, includinginfrared and optical signals, that carry information streams, such asdigital data. Such signals are examples of carrier waves.

The term computer-readable medium is used herein to refer to any mediumthat participates in providing instructions to processor 602 forexecution. Such a medium may take many forms, including, but not limitedto, non-volatile media, volatile media and transmission media.Non-volatile media include, for example, optical or magnetic disks, suchas storage device 608. Volatile media include, for example, dynamicmemory 604. Transmission media include, for example, coaxial cables,copper wire, fiber optic cables, and waves that travel through spacewithout wires or cables, such as acoustic waves and electromagneticwaves, including radio, optical and infrared waves. Signals that aretransmitted over transmission media are herein called carrier waves.

Common forms of computer-readable media include, for example, a floppydisk, a flexible disk, a hard disk, a magnetic tape, or any othermagnetic medium, a compact disk ROM (CD-ROM), or any other opticalmedium, punch cards, paper tape, or any other physical medium withpatterns of holes, a RAM, a programmable ROM (PROM), an erasable PROM(EPROM), a FLASH-EPROM, or any other memory chip or cartridge, a carrierwave, or any other medium from which a computer can read.

Network link 678 typically provides information communication throughone or more networks to other devices that use or process theinformation. For example, network link 678 may provide a connectionthrough local network 680 to a host computer 682 or to equipment 684operated by an Internet Service Provider (ISP). ISP equipment 684 inturn provides data communication services through the public, world-widepacket-switching communication network of networks now commonly referredto as the Internet 690. A computer called a server 692 connected to theInternet provides a service in response to information received over theInternet. For example, server 692 provides information representingvideo data for presentation at display 614.

The invention is related to the use of computer system 600 forimplementing the techniques described herein. According to oneembodiment of the invention, those techniques are performed by computersystem 600 in response to processor 602 executing one or more sequencesof one or more instructions contained in memory 604. Such instructions,also called software and program code, may be read into memory 604 fromanother computer-readable medium such as storage device 608. Executionof the sequences of instructions contained in memory 604 causesprocessor 602 to perform the method steps described herein. Inalternative embodiments, hardware, such as application specificintegrated circuit 620, may be used in place of or in combination withsoftware to implement the invention. Thus, embodiments of the inventionare not limited to any specific combination of hardware and software.

The signals transmitted over network link 678 and other networks throughcommunications interface 670, which carry information to and fromcomputer system 600, are exemplary forms of carrier waves. Computersystem 600 can send and receive information, including program code,through the networks 680, 690 among others, through network link 678 andcommunications interface 670. In an example using the Internet 690, aserver 692 transmits program code for a particular application,requested by a message sent from computer 600, through Internet 690, ISPequipment 684, local network 680 and communications interface 670. Thereceived code may be executed by processor 602 as it is received, or maybe stored in storage device 608 or other non-volatile storage for laterexecution, or both. In this manner, computer system 600 may obtainapplication program code in the form of a carrier wave.

Various forms of computer readable media may be involved in carrying oneor more sequence of instructions or data or both to processor 602 forexecution. For example, instructions and data may initially be carriedon a magnetic disk of a remote computer such as host 682. The remotecomputer loads the instructions and data into its dynamic memory andsends the instructions and data over a telephone line using a modem. Amodem local to the computer system 600 receives the instructions anddata on a telephone line and uses an infra-red transmitter to convertthe instructions and data to an infra-red signal, a carrier wave servingas the network link 678. An infrared detector serving as communicationsinterface 670 receives the instructions and data carried in the infraredsignal and places information representing the instructions and dataonto bus 610. Bus 610 carries the information to memory 604 from whichprocessor 602 retrieves and executes the instructions using some of thedata sent with the instructions. The instructions and data received inmemory 604 may optionally be stored on storage device 608, either beforeor after execution by the processor 602.

In the foregoing specification, the invention has been described withreference to specific embodiments thereof. It will, however, be evidentthat various modifications and changes may be made thereto withoutdeparting from the broader spirit and scope of the invention. Thespecification and drawings are, accordingly, to be regarded in anillustrative rather than a restrictive sense.

1. A system for early detection of localized exposure to an agent activeon a human population, comprising: a processor; and a computer readablemedium carrying one or more sequences of instructions which, whenexecuted by the processor, cause the processor to carry out the stepsof: collecting, for each data type of a plurality of different datatypes relevant for detecting exposure to the agent, a plurality of timeseries of data at a corresponding plurality of locations associated withthe data type, wherein each different data type is an indicator of humanpopulation health that may be affected by exposure to the agent;generating measures of anomalous conditions of human population health,each indicative of the localized exposure to the agent, at the pluralityof locations for each of the plurality of different data types based onthe plurality of time series and a temporal model for each data type;performing cluster analysis on the measures of anomalous conditions todetermine an estimated location and estimated spatial size of effectsfrom the agent around the estimated location; generating a replica ofanomalous conditions, resulting from a hypothetical exposure to theagent, for a particular location within the estimated spatial size ofeffects determined during said step of performing cluster analysis bymodeling a hypothetical exposure event that is based on at least one ofthe estimated location and the spatial size of the effects determinedduring said step of performing cluster analysis; matching the replica tothe measures of anomalous conditions for the particular location todetermine whether the measures of anomalous conditions indicate anactual exposure event similar to the hypothetical exposure event;producing at least one of a modified estimated location and a modifiedestimated spatial size or effects from the agent based on a result ofsaid step of matching the replica; generating a modified replica ofanomalous conditions for a second particular location within themodified estimated spatial size by modeling a modified hypotheticalexposure event that is based on at least one of the modified estimatedlocation and the modified spatial size of the effects; matching thereplica to the measures of anomalous conditions for the secondparticular location to determine whether the measures of anomalousconditions indicate an actual exposure event similar to the modifiedhypothetical exposure event; and if it is determined an actual exposureevent has occurred, then sending, via a computer interface, an alertsignal that indicates a likely time, a likely location and a spatialsize of the actual exposure.
 2. the system of claim 1, wherein: saidstep of generating measures of anomalous conditions further comprisesdetermining a particular temporal model for a particular data type ofthe plurality of data types by performing auto-regression on a portionof a time series of data for the particular data type.
 3. The system ofclaim 1, wherein: said step of generating measures of anomalousconditions further comprises determining a particular temporal model fora particular data type of the plurality of data types by performing acumulative summation process control analysis on a portion of a timeseries of data for the particular data type.
 4. The system of claim 1,wherein: said step of generating measures of anomalous conditionsfurther comprises: determining an expected value for a particular datatype at a particular time based on a particular temporal model for theparticular data type; and generating a measure of anomalous conditionsbased on the expected value and an actual value for the particular datatype at the particular time; and said step of performing clusteranalysis further comprises comparing a first ratio of the actual valuefor a first data type divided by the expected value for the first datatype at a first location with a second ratio of the actual value for asecond data type divided by the expected value for the second data typeat a second location.
 5. The system as recited in claim 1, wherein: thefirst data type and the second data type are the same; and the firstlocation and the second location are different.
 6. The system as recitedin claim 1, wherein the first data type and the second data type aredifferent.
 7. The system as recited in claim 4, wherein the data typesinclude at least one of: over the counter drug sales at a drug store;absenteeism at a school; number of medical insurance claim forms orphysician office visits filed in an area; and number of cases incategories of symptoms at a hospital or health clinic.
 8. The system ofclaim 1, wherein said step of performing cluster analysis furthercomprises constructing a circle, having a radius representative of thespatial size, around the plurality of locations corresponding to theplurality of time series of data.
 9. The system of claim 1, wherein thedata types include at least one of: over the counter drug sales at adrug store; absenteeism at a school; number of medical insurance claimforms or physician office visits flied in an area; and number of casesin categories of symptoms at a hospital or health clinic.